{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "62ac6c38",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/low_level/retrieval.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c919f307-07b1-41bd-bc5d-51edd8677983",
   "metadata": {},
   "source": [
    "# Building Retrieval from Scratch\n",
    "\n",
    "In this tutorial, we show you how to build a standard retriever against a vector database, that will fetch nodes via top-k similarity.\n",
    "\n",
    "We use Pinecone as the vector database. We load in nodes using our high-level ingestion abstractions (to see how to build this from scratch, see our previous tutorial!).\n",
    "\n",
    "We will show how to do the following:\n",
    "1. How to generate a query embedding\n",
    "2. How to query the vector database using different search modes (dense, sparse, hybrid)\n",
    "3. How to parse results into a set of Nodes\n",
    "4. How to put this in a custom retriever"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcb486eb-c0b8-40e2-9038-da97aef63139",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "We build an empty Pinecone Index, and define the necessary LlamaIndex wrappers/abstractions so that we can start loading data into Pinecone. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0823f361",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a694cf9",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-readers-file pymupdf\n",
    "%pip install llama-index-vector-stores-pinecone\n",
    "%pip install llama-index-embeddings-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fcad439",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22fb9e0a-566b-4f34-b9cf-72193cb51adb",
   "metadata": {},
   "source": [
    "#### Build Pinecone Index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc739b4d-491f-406d-a0e6-f6b1e8c126dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pinecone import Pinecone, Index, ServerlessSpec\n",
    "import os\n",
    "\n",
    "api_key = os.environ[\"PINECONE_API_KEY\"]\n",
    "pc = Pinecone(api_key=api_key)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20ba2f76-29d8-4dc5-b25c-64dcfe9e8d23",
   "metadata": {},
   "outputs": [],
   "source": [
    "# dimensions are for text-embedding-ada-002\n",
    "dataset_name = \"quickstart\"\n",
    "if dataset_name not in pc.list_indexes().names():\n",
    "    pc.create_index(\n",
    "        dataset_name,\n",
    "        dimension=1536,\n",
    "        metric=\"euclidean\",\n",
    "        spec=ServerlessSpec(cloud=\"aws\", region=\"us-east-1\"),\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45f9a999-dac2-4bc8-8133-ccc851b76a6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "pinecone_index = pc.Index(dataset_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3216f9e2-946d-4b43-8b8c-acf6788633a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# [Optional] drop contents in index\n",
    "pinecone_index.delete(deleteAll=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89246384-983c-4e2c-ac05-ffdc1d54a594",
   "metadata": {},
   "source": [
    "#### Create PineconeVectorStore\n",
    "\n",
    "Simple wrapper abstraction to use in LlamaIndex. Wrap in StorageContext so we can easily load in Nodes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "775aabb2-3dd2-44b1-b6b9-2f7326409e10",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.vector_stores.pinecone import PineconeVectorStore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15f0aa46-9f5b-42c1-9374-db94781363f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store = PineconeVectorStore(pinecone_index=pinecone_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be9437a0-3d52-4586-8217-43944971a2cc",
   "metadata": {},
   "source": [
    "#### Load Documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48739cfc-c05a-420a-8c78-280892f8d7a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir data\n",
    "!wget --user-agent \"Mozilla\" \"https://arxiv.org/pdf/2307.09288.pdf\" -O \"data/llama2.pdf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "079666c5-0685-413d-a765-17f71ae89506",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from llama_index.readers.file import PyMuPDFReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4eee7692-2188-4552-9f2e-cb90ac6b7678",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = PyMuPDFReader()\n",
    "documents = loader.load(file_path=\"./data/llama2.pdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f1c3a10-3880-48e7-be1d-587089209420",
   "metadata": {},
   "source": [
    "#### Load into Vector Store\n",
    "\n",
    "Load in documents into the PineconeVectorStore. \n",
    "\n",
    "**NOTE**: We use high-level ingestion abstractions here, with `VectorStoreIndex.from_documents.` We'll refrain from using `VectorStoreIndex` for the rest of this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d9e4b7e-1815-4ced-bd34-b57060dc1920",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "from llama_index.core import StorageContext"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42617bc6-ef87-417c-800e-45e6709f8d50",
   "metadata": {},
   "outputs": [],
   "source": [
    "splitter = SentenceSplitter(chunk_size=1024)\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, transformations=[splitter], storage_context=storage_context\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ee3336a-41cd-4a0a-9e6e-c5aef46288a9",
   "metadata": {},
   "source": [
    "## Define Vector Retriever\n",
    "\n",
    "Now we're ready to define our retriever against this vector store to retrieve a set of nodes.\n",
    "\n",
    "We'll show the processes step by step and then wrap it into a function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "772788f0-ab8b-4ff7-9c22-08722118bc7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_str = \"Can you tell me about the key concepts for safety finetuning\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4816f89e-494c-42c5-bed2-61755360cd13",
   "metadata": {},
   "source": [
    "### 1. Generate a Query Embedding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12f2239e-6cac-4775-a7fd-15ecefe47561",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "\n",
    "embed_model = OpenAIEmbedding()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f7f3241-0512-48d4-ad7f-74f28e5aeeff",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_embedding = embed_model.get_query_embedding(query_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "931c9933-9626-431f-a6d9-9079924ed5ee",
   "metadata": {},
   "source": [
    "### 2. Query the Vector Database\n",
    "\n",
    "We show how to query the vector database with different modes: default, sparse, and hybrid.\n",
    "\n",
    "We first construct a `VectorStoreQuery` and then query the vector db."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba96b44e-147c-4eae-9eee-396d9b3232ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "# construct vector store query\n",
    "from llama_index.core.vector_stores import VectorStoreQuery\n",
    "\n",
    "query_mode = \"default\"\n",
    "# query_mode = \"sparse\"\n",
    "# query_mode = \"hybrid\"\n",
    "\n",
    "vector_store_query = VectorStoreQuery(\n",
    "    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b29a2a78-afef-4cb9-bc43-8cfb0e1863e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# returns a VectorStoreQueryResult\n",
    "query_result = vector_store.query(vector_store_query)\n",
    "query_result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6ff5515-dced-4ff3-933a-230478b7b7c3",
   "metadata": {},
   "source": [
    "### 3. Parse Result into a set of Nodes\n",
    "\n",
    "The `VectorStoreQueryResult` returns the set of nodes and similarities. We construct a `NodeWithScore` object with this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4d03132-810f-4ef6-bc48-5f1e077ee40e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.schema import NodeWithScore\n",
    "from typing import Optional\n",
    "\n",
    "nodes_with_scores = []\n",
    "for index, node in enumerate(query_result.nodes):\n",
    "    score: Optional[float] = None\n",
    "    if query_result.similarities is not None:\n",
    "        score = query_result.similarities[index]\n",
    "    nodes_with_scores.append(NodeWithScore(node=node, score=score))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48a73497-2912-41ba-ba12-3d24f7acf78d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.response.notebook_utils import display_source_node\n",
    "\n",
    "for node in nodes_with_scores:\n",
    "    display_source_node(node, source_length=1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36563225-9265-4000-b1c5-55d7299f6f2c",
   "metadata": {},
   "source": [
    "### 4. Put this into a Retriever\n",
    "\n",
    "Let's put this into a Retriever subclass that can plug into the rest of LlamaIndex workflows!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5300799-0b44-4c11-9219-a9473fe5d3b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import QueryBundle\n",
    "from llama_index.core.retrievers import BaseRetriever\n",
    "from typing import Any, List\n",
    "\n",
    "\n",
    "class PineconeRetriever(BaseRetriever):\n",
    "    \"\"\"Retriever over a pinecone vector store.\"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        vector_store: PineconeVectorStore,\n",
    "        embed_model: Any,\n",
    "        query_mode: str = \"default\",\n",
    "        similarity_top_k: int = 2,\n",
    "    ) -> None:\n",
    "        \"\"\"Init params.\"\"\"\n",
    "        self._vector_store = vector_store\n",
    "        self._embed_model = embed_model\n",
    "        self._query_mode = query_mode\n",
    "        self._similarity_top_k = similarity_top_k\n",
    "        super().__init__()\n",
    "\n",
    "    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:\n",
    "        \"\"\"Retrieve.\"\"\"\n",
    "        if query_bundle.embedding is None:\n",
    "            query_embedding = self._embed_model.get_query_embedding(\n",
    "                query_bundle.query_str\n",
    "            )\n",
    "        else:\n",
    "            query_embedding = query_bundle.embedding\n",
    "\n",
    "        vector_store_query = VectorStoreQuery(\n",
    "            query_embedding=query_embedding,\n",
    "            similarity_top_k=self._similarity_top_k,\n",
    "            mode=self._query_mode,\n",
    "        )\n",
    "        query_result = self._vector_store.query(vector_store_query)\n",
    "\n",
    "        nodes_with_scores = []\n",
    "        for index, node in enumerate(query_result.nodes):\n",
    "            score: Optional[float] = None\n",
    "            if query_result.similarities is not None:\n",
    "                score = query_result.similarities[index]\n",
    "            nodes_with_scores.append(NodeWithScore(node=node, score=score))\n",
    "\n",
    "        return nodes_with_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33ecc1c4-99f1-4826-9005-ed3befbff08e",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = PineconeRetriever(\n",
    "    vector_store, embed_model, query_mode=\"default\", similarity_top_k=2\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53739f5c-23c6-4329-adc5-dcd27d2c878d",
   "metadata": {},
   "outputs": [],
   "source": [
    "retrieved_nodes = retriever.retrieve(query_str)\n",
    "for node in retrieved_nodes:\n",
    "    display_source_node(node, source_length=1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8c77b72-3085-47c5-81cf-383821b3a430",
   "metadata": {},
   "source": [
    "## Plug this into our RetrieverQueryEngine to synthesize a response\n",
    "\n",
    "**NOTE**: We'll cover more on how to build response synthesis from scratch in future tutorials! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87797531-8aa7-4b66-bb66-ce7e2f06f870",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.query_engine import RetrieverQueryEngine\n",
    "\n",
    "query_engine = RetrieverQueryEngine.from_args(retriever)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77c43b5d-ab9b-4e18-98b7-e4b38521665d",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = query_engine.query(query_str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8dd8613-7a5e-4ce5-b49b-bd2b4ebf977b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to train the model to align with safety guidelines. Safety RLHF integrates safety into the RLHF pipeline by training a safety-specific reward model and gathering challenging adversarial prompts for fine-tuning. Safety context distillation refines the RLHF pipeline by generating safer model responses using a safety preprompt and fine-tuning the model on these responses without the preprompt. These concepts are used to mitigate safety risks and improve the safety of the model's responses.\n"
     ]
    }
   ],
   "source": [
    "print(str(response))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_v2",
   "language": "python",
   "name": "llama_index_v2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
